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Abstract 

Power laws pervade statistical physics and complex systems [2d], but, traditionally, researchers in these 
fields have paid little attention to properly fit these distributions. Who has not seen (or even shown) 
a log-log plot of a completely curved line pretending to be a power law? Recently, Clauset et al. have 
proposed a method to decide if a set of values of a variable has a distribution whose tail is a power 
law [3J . The key of their procedure is the identification of the minimum value of the variable for which 
the fit holds, which is selected as the value for which the Kolmogorov-Smirnov distance between the 
empirical distribution and its maximum-likelihood fit is minimum. However, it has been shown that 
this method can reject the power-law hypothesis even in the case of power-law simulated data 0]. Here 
we propose a simpler selection criterion, which is illustrated with the more involving case of discrete 
power-law distributions. 

1 Procedure 

This method is similar in spirit to the one by Clauset ct al. 3,4 , but with important differences [5]. Here 
we just present the recipe, the justification is available in Ref. [5J. 

Consider a discrete power-law distribution, defined for n = a, a + 1, a + 2, . . . co (with a natural), 



f{n) — Prob[variable = n] 
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with (3 > and C the Hurwitz zeta function [3J (Riemann function for a = 1), 




Note then that f(n) is a power law but S(n) is not (only asymptotically). 



For a fixed, the data values verifying n > a are numbered from i = 1 to N a , and the remainder is removed. 
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Then, the method consists of the following steps: 

1. Maximum likelihood estimation of the exponent /?. 
Calculate the log-likelihood function, 

1 N a 

^ = W E ln /("«) = ~ ln CCS + 1, o) - (j8 + 1) In G a , 

a i— 1 

with G Q the geometric mean of the data in the range, ln G a = N^ 1 ^2 Inn,. 

Calculate the maximum of (for instance through the downhill simplex method 7 ), 

/3 emp = max^(/3), 

V/3 

which has an error (standard deviation [5]) 

ftemp 



fNn 



The computation of the zcta function uses the Euler-Maclaurin formula 

£ /(*) - E /(*) + f hk)dk + ^ ~ £ -m^w, 

k=0 k=0 M k=l '' 

where B 2k are the Bernoulli numbers (B 2 = 1/6, B A = -1/30, B 6 = 1/42, B$ = -1/30, . . . ) 0. So, 

M-l , 1 p 

«•»•«) - g + H^T" + WTW + 

with 

^-x(M) = ^+ 2 fc r 2 ,\ ( 7t 2 ^ 3) ^-3(M) and G l( M) - 



2fc(2fc- \){a + Mf ' 1V ' 2(a + M)T+ 1 ' 

The second sum in the formula runs from k = 1 to a fixed P, taken P = 18, except if a minimum 
value term (P 2 fcC2A;-i(M)) is reached, case in which the sum is stopped; this ensures a better 
convergence [5]. We also take M — 14. 

Once we obtain /3 emp , how do we know if the fit is good or bad? 
2. Calculation of the Kolmogorov-Smirnov statistic [7], 



demp — maxy n > a 



— — S(n; (3 em p) 

1 V a 



with iV n the number of data taking values larger or equal to n. The maximization is performed for 
all values of n > a, integer and not integer. 

Large and small values of d em p denote respectively bad and good fits. But what is large and small? 
This is determined in Step 3. 
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3. Simulation of the discrete power-law distribution, with exponent P emp and n > a. 

We use a generalization of the rejection method of Ref. [TU] : 

(a) Generate a uniform random number u between and Umax, with a — \juJiax mv ■ 

(b) Obtain a new random number 

y = mt(l/u 1//3 = m "), 

where int(x) means the integer part of x. Notice that its probability function is 

q(y) = (a/y)^-(a/(y+l)f°™. 

(c) Accept y as the simulated value if a new uniform random number v (between and 1) fulfills 

f(y)q(a) 



v < 



/(«)«(») 



and reject y otherwise. If accepted, take n = y. 

Notice that the computation of the £ function is not required. 

Defining r = (1 + y -1 )P«>w anc [ b = (a + l)^ em p the acceptation condition becomes simpler, 

t — 1 ar 

b — a Pem v o 

(d) Repeat the process until N a values of n = y are obtained. 

4. Apply step 1 (maximum likelihood estimation) to the simulated data. 
Call the obtained exponent /3 S j m . 

5. Apply step 2 (calculation of the Kolmogorov-Smirnov statistic) to the simulated data, using the fit 
obtained in step 4, as 

■ S(n', fisim) i 



' St ill 



max Vn > Q 



N a 

with N S i m (n) the number of simulated data taking values larger or equal to n. 

6. Comparison of the 2 statistics d emp and d S i m is not enough, so: 

Repeat steps 3, 4, and 5 a large enough number of times (e.g., 100 or more, as allowed by compu- 
tational resources), in order to get an ensemble of values of d S i m . 

7. Compute p— value as 

number of simulations with d s i m > d emv 
V = 



number of simulations 
The error of the p— value comes from that of a binomial distribution, 



p(l -p) 



number of simulations 



Low values of p, like p < 0.05 are considered bad fits. 

For higher values, p > 0.05, the power-law fit with f3 emp cannot be rejected. 
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Repeating the whole procedure for "all" values of a we obtain a set of acceptable pairs of a and j3 emp . 
Select the one that gives the smallest value of a provided that p is above 0.20 (for instance). In a formula, 

a* = min{a such that p > 0.20}, 

which has associated the resulting exponent P* mp . 

Note that the final p— value of the procedure is not the one obtained for fixed a, but this is not relevant 
in order to provide a good fit (as long as the latter is larger than, say, 0.20). 

The figures illustrate the results for n = word frequencies in the Finnish novel Seitsemdn veljestd by 
Aleksis Kivi, for which a* = 1 and (3* mp = 1.13 ± 0.01, with N a * = 22035 and 8.1 x 10 4 word tokens. 
Notice that f(n) is a power law but S(n) is not, but both are representations of a power-law distribution. 

We thank R. D. Malmgren (for discussions), L. Devroye (for his book), G. Boleda (for many things!), 
and the assistants to the 2012 meeting of the network complexitat . cat (for their interest). 
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